Boston data is included in R-package as a demonstartion or example.
Dataset contains social, environmental and economical information about great Boston area. It includes following variables:
Dataset has 14 variables and 506 observations and all variables are numerical.
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506 14
As seen in pairs plot, most of the variables are not normally distributed. Most of them are skewed and some of them are bimodal. Correlations between variables are better viewed in correlation plotting, where on the upper-right side the biggest circles indicate highest correlations (blue = positive or red = negative). Corresponding number values are mirrored on the lower-left side.
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table (compare with original summary above).
Variable crime rate has been changed to categorical variable with 4 levels: low, med_low, med_high and high. Each class includes quantile of data (25%).
Train and test sets have been created by dividing original (standardized) data to two groups randomly. 80% belongs to train set and 20% to test set.
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
In linear discriminant analysis (LDA) only the train set (80% of data) has been analysed. Target variable is the new categorical variable, crime rate (low, med_low, med_high, high). In LDA model all other variables of the data set are used as predictor variables (see Overview of data).
In biplot below can be seen that variable “rad” (index of accessibility to radial highways) has extremely high influence to LD1 and LD2 if compared to the other variables. In biplot all horizontal vectors describes contribution to LD1 dimension (x-axis) and vertical vectors LD2-dimension (y-axis). Sign of coefficient of linear discriminant determines the direction of vector. The longer the vector, the bigger is influence. Most of the vectors contribute both LD1 and LD2. Because in biplot two dimensions are illustrated, directions of most of variables are in different angles between LD1 and LD 2. For example, in the LDA table below the most significant variable of LD1 “rad” has coefficients LD1 = 3.27 and LD2 = 1.05. They are directly readable as coordinates of the arrow head. Similarly the second most significant variable of LD2, “nox” has its head ccordinates in (-0.69, 0.29). LDA1 alone explains 0.95% of model. LD2 explains 3% and LD3 only 1%.
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2500000 0.2425743 0.2400990 0.2673267
Group means:
zn indus chas nox rm
low 1.01303506 -0.9021772 -0.07742312 -0.8848953 0.4417843
med_low -0.08559517 -0.3204814 -0.07145661 -0.5442241 -0.1135139
med_high -0.39794966 0.1101991 0.25532354 0.3214000 0.1730271
high -0.48724019 1.0169921 -0.05360128 1.0489936 -0.4165842
age dis rad tax ptratio
low -0.8875603 0.8687725 -0.6919117 -0.7516605 -0.45371408
med_low -0.3503387 0.3291057 -0.5470944 -0.4819529 -0.04450455
med_high 0.3873862 -0.3617872 -0.4147416 -0.3333111 -0.32612786
high 0.7940623 -0.8431387 1.6393984 1.5149640 0.78225547
black lstat medv
low 0.37218460 -0.76515982 0.534376732
med_low 0.31110060 -0.16121224 0.007971653
med_high 0.06422589 -0.05582227 0.219721545
high -0.68602652 0.84336094 -0.633193558
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.07350369 0.84491544 -0.935055068
indus 0.03935947 -0.15415084 0.171588821
chas -0.10357515 -0.05540923 -0.001581682
nox 0.37321704 -0.70760452 -1.219772012
rm -0.12274728 -0.10220041 -0.163574095
age 0.23300571 -0.30332149 -0.292182841
dis -0.05474661 -0.27656650 0.120967851
rad 3.38621306 0.94414648 -0.196041805
tax 0.05122457 -0.08373380 0.682267768
ptratio 0.08044487 0.06951790 -0.123032407
black -0.09081145 0.04681548 0.168583693
lstat 0.23072032 -0.19668229 0.318314245
medv 0.20300555 -0.35807830 -0.205483232
Proportion of trace:
LD1 LD2 LD3
0.9574 0.0329 0.0097
In the test dataset catecorigal crime variable has been removed. In the table below true values of the original test data and predicted values of the test data (crime removed) are cross-tabulated. Total amount of observations is 102 (506/5 +1). In the table on diagonal axis (from top-left corner) are true values (sum = 76) and all others are predicted values (sum = 26). Prediction error is 26/102 ≈ 0.25
predicted
correct low med_low med_high high Sum
low 11 13 2 0 26
med_low 3 22 3 0 28
med_high 1 7 19 2 29
high 0 0 0 19 19
Sum 15 42 24 21 102
In this model euclidean distance matrix has been calculated. Results can be seen in table below. By using K-means algorithm, the optimal number of clusters can be investigated. When TWSS (total within sum of squares) drops significally, it indicates optimal number of clusters. In this case optimal number of clusters is 2 or 3. In the first plotting, data has classified into two and in the second plotting three clusters.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Here LDA is calculated with the clusters as target classes. All other variables in the Boston data are predictor variables. In LDA tables and biplots, differences between number of clusters can be seen. Variable “rad” is the most influencial linear separator for the clusters in LD1 and variable “zn” in LD2. At the moment Knitting does not accept my code. Code is visible below. I tried to fix this later.
data(“Boston”) boston_scaled <- scale(Boston) boston_scaled <- as.data.frame(boston_scaled)
the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = “red”, tex = 0.75, choices = c(1,2)){ heads <- coef(x) arrows(x0 = 0, y0 = 0, x1 = myscale * heads[,choices[1]], y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads) text(myscale * heads[,choices], labels = row.names(heads), cex = tex, col=color, pos=3) }
km3 <-kmeans(boston_scaled, centers = 3)
km4 <-kmeans(boston_scaled, centers = 4)
km5 <-kmeans(boston_scaled, centers = 5)
km6 <-kmeans(boston_scaled, centers = 6)
clu3 <- as.factor(km3$cluster)
clu4 <- as.factor(km4$cluster)
clu5 <- as.factor(km5$cluster)
clu6 <- as.factor(km6$cluster)
lda.fit3 <- lda(clu3 ~ ., data = boston_scaled)
lda.fit3
lda.fit4 <- lda(clu4 ~ ., data = boston_scaled)
lda.fit4
lda.fit5 <- lda(clu5 ~ ., data = boston_scaled)
lda.fit5
lda.fit6 <- lda(clu6 ~ ., data = boston_scaled)
lda.fit6
target classes as numeric
classes <- as.numeric(clu3)
plot(lda.fit3, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit, myscale = 1)
classes <- as.numeric(clu4) plot(lda.fit4, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit, myscale = 1)
classes <- as.numeric(clu5) plot(lda.fit5, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit, myscale = 1)
classes <- as.numeric(clu6) plot(lda.fit6, dimen = 2, col = classes, pch = classes) lda.arrows(lda.fit, myscale = 1)
Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Data points are of course in same positions. Grouping differs slightly in main group if colours are coded either by crime or by cluster. In the separate group high-crime is well isolated whereas in clusters, there are two of them. If colours are coded by crime, particularly the high-crime is better gathered to one group.